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Absi:r  act 


This  paper  describes  t^o  ways  to  allocate  a  hierarchical'^ 
(i.e,i,  tree  structured)  data  base.  Both  methods  address  nodes 
using  traces.  A  trace  is  an  n-tuple  of  indices,  [ x ( 1 x ( n)  ], 
which  describes  the  unique  path  from  the  root  cf  the  tree  to  the 
node  being  addressed.  That  is,  one  takes  the  x(1}-th  branch  from 
the  root,  followed  by  the  x{2)-th  branch  from  the  next  node, 
etc,,  until  the  path  is  completed.  The  last  node  on  the  path  is 
the  one  being  addressed. 

Given  a  set  of  traces  which  represent  a  set  of  nodes  in 
a  tree,  the  problem  is  to  allocate  them  efficiently  on  a  file.  We 
approach  the  problem  by  finding  ways  cf  mapping  n-tuples  (i.e., 
traces)  onto  natural  numbers  {i,e,,  file  indices).  The  two 
methods  investigated  include  hashing  traces  into  addresses  and 
enumerating  traces  in  some  predetermined  order.  Some  experimental 
results  cn  hashing  a  set  cf  traces  based  on  a  local  IMS  data  base 
are  presented.  Several  enumeration  schemes  are  described  and 
analyzed,  including  one  which  adapts  to  a  changing  distribution 
of  nodes  within  the  tree. 
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1  .  Introduction 


A  tree  structured  data  base,  often  called  a  hierarch ical 
data  base,  is  a  ccllection  of  trees,  whose  internal  nodes  and/or 
leaves  contain  data,  Many  ccmmercial  data  base  managenent  systems 
structure  their  data  in  this  way,  fcr  example,  IMS[ IBM  Corp.], 
TDMS  [Bleier  &  "^/crhaus  1968  ],  and  System  2C0C  [197C].  The  claim 
is  that  data  about  the  real  world  can  be  structured  naturally  as 
trees.  We  do  net  intend  to  enter  the  controversy  of  whether  a 
tree  structured  data  base  is  really  the  most  appropriate  for 
ccmmercial  applications.  Bather,  our  aim  is  to  analyze  this 
organization  with  an  eye  toward  finding  .  efficient 
implementations.  If  efficient  implementations  are  found,  then 
presumably  this  will  mahe  the  tree  structure  a  more  attractive 
choice. 

The  tree  structure  in  a  hierarchical  data  base  is 
generally  implemented  using  pointers  and  physical  contiguity 
[Winkler  1970].  For  example,  one  possible  organization  is  for 
every  node  to  poirt  to  its  next  youngest  brother  and  to  its 
oldest  child.  Ey  adding  extra  pointers,  such  as  a  pointer  to  a 
node’s  parent,  redundant  structural  informaticn  is  retained  so 
that  certain  queries  to  the  data  base  can  be  serviced  faster. 
However,  pointers  are  not  the  only  way  cf  reproducing  the 
structure  of  a  tree.  For  example,  brothers  may  be  stored  in 
contiguous  locations  to  reduce  the  number  cf  pointers,  Knuth 
[1963]  cites  a  number  of  simple  methods  (e.g.,  preorder)  to  map 
binary  trees  onto  linear  storage,  where  the  position  of  a  node  in 
linear  storage  identifies  its  position  within  the  tree. 
Unfortunately,  methods  such  as  preorder  are  usually  inefficient 
for  seme  operations,  such  as  insert  and  delete.  In  this  paper  we 
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vill  examine  efficient  ways  of  reproducinc  a  tree  in  linear 
memory  without  the  use  of  pointers. 

This  introductory  section  continues  with  a  description 
of  a  notation  for  discussing  tree  structured  data  bases.  The 
discussion  is  then  limited  to  a  specific  addressing  technigne^ 
called  "traces”,  which  identifies  the  location  of  a  node  within 
the  tree.  General  properties  cf  allocation  schemes  ^'Jhich 
implement  traces  are  outlined.  A  model  for  data  base  behaviour  is 
proposed  which  will  be  used  in  later  sections  to  analyse  specific 
schemes. 


1.1  Tree  Structured  Tata  Eases 

In  the  following  discussion  we  assume  the  reader  has 
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repeating  group  or  segment  tvpe) .  The  actual  data  is  stored  in 
the  data  tase  trees  in  nod^  (sometimes  called  records ,  elements^ 
01^  segments)  ^  each  node  having  an  associated  type  from  the 
definition  tree.  A  node  of  type  T1  in  the  data  base  can  have  a 
child  of  type  T2  if  and  only  if  T1  is  the  parent  of  T2  in  the 
definition  tree.  There  are  no  logical  constraints  on  when  one  can 
insert  a  node  whose  type  is  defined  by  the  root  node  of  the 
definition  tree  (i.e.,  the  root  type) »  since  such  a  node  has  no 
parents.  The  data  tase  can  he  thought  of  as  a  single  tree  by 
assuming  the  existence  of  an  imaginary  root  node  on  level  0  whose 
children  are  of  the  root  type. 

The  above  concepts  are  illustrated  in  figures  1-1,  1-2, 

and  1 - 3 . 
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Figure  1-2  Definition  Tree 
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Figure  1-3  '  An  Interpretation  for  fig,  1-2 
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1.2  Accessing  the  Data  Ease 

To  identify  nodes  withir  fhs  tree,,  ve  adopf  a  logical 
addressing  scheme  called  traces  [ lowenthal  1971].  A  filial  sat  is 
the  set  of  all  children  of  a  single  type  which  have  the  same 
parent.  We  impose  an  arbitrary  but  fixed  ordering  on  each  filial 
set.  Thus^  if  there  are  m  members  of  some  filial  set,  then  each 
node  in  that  set  has  a  unique  index  in  the  range  one  ■  to  m.  The 
indices  could  be  specified,  say,  by  the  order  in  which  the  nodes 
were  inserted  into  the  data  base, 

how,  every  node  can  be  uniquely  identified  by 
specifying,  first,  its  type  and  second,  the  index  of  itself  and 
each  of  its  ancestors  in  their  respective  filial  sets,  A  trac e  is 
a  type  followed  by  a  tuple  of  indices.  If  type  T  is  on  level  n  of 
t he ■ d ef inition  tree,  then  a  trace  for  a  type  T  node  is  an  n-tuple 
T[  X  ( 1 )  , x  ( 2)  , ,  ,  .  , X  (n)  ] ,  where  x  (i)  is  the  index  of  its  level  i 
ancestor  in  the  filial  set  of  which  that  ancestor  is  a  member. 
Sc,  to  find  the  type  T  node  [  x  ( 1)  ,  x  {2)  , ,  ,  .  ,x  (n)  ],  we  identify  the 
x(1)-th  node  on  level  1  and  ai;  each  succeeding  level,  i  (1<i<n), 
branch  to  the  x{i+1)-th  child  of  the  preceding  node. 

The  reader  should  convince  himself  that  a  trace 
T[ X  ( 1 ) j , X (n)  ]  defines  a  unique  path  from  the  root  to  a  type  T 
node.  In  addition,  every  type  T  node  (on  level  n)  has  a  unique 
trace  T[ x  ( 1)  , , , . , x  (n)  ],  For  example,  the  traces  for  the  nodes  in 
the  data  base  of  fiqure  1~1  are  listed  in  figure  1-4. 
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Figure  1-4  Traces  for  nodes  in  fig .  1-1 
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Q 

Nearly  every  storage  structure  for  trees  has  at  least  an 
implicit  ordering  of  children  within  filial  sets.  For  example^ 
assume  every  nods  points  to  its  oldest  child  of  each  type  and  to 
the  next  youngest  child  in  the  filial  set  of  which  it  is  a 
member.  Then  given  a  trace,  it  is  possible  to  find  the  node 
described  by  that  trace  by  following  pointers  in  the  obvious, 
albeit  tedious,  way. 

In  addition  to  locating  a  node  within  the  tree,  a  trace 
also  locates  all  of  the  node‘s  ancestors.  By  deleting  the 
trailing  indices  of  the  trace  and  changing  the  type  accordingly, 
the  traces  for  a  node’s  father,  grandfather,  etc.  are  obtained. 
This  extra  information  content  mahes  traces  guite  convenient  for 
the  evaluation  of  certain  user  queries.  Lowenthal  [1971]  shows 
that  by  using  traces  a  fairly  elaborate  query  language  can  be 
serviced  in  a  clean  way. 

To  access  our  nodes  using  traces,  we  must  find  a  storage 
structure  which  maps  traces  into  addresses.  We  can  treat  the 
storage  allocation  problem  for  nodes  of  type  Ti  as  a  mapping 
problem: 

n 

Ki:  N  ->  N 

where  traces  for  type  Ti  are  of  length  n,  and  { 1 , 2 , 3 , 4, , . , }  , 
Intuitively,  we  want  to  map  these  n-tuples,  called  traces,  onto  a 
file  denoted  by  N. 

The  assumption  in  the  above  mapping.  Mi,  is  that  each 
type  is  mapped  onto  its  own  file.  Therefore,  there  will  be  as 
many  files  as  there  are  nodes  in  the  defirition  tree.  Since  many 
common  user  operations  are  done  across  all  records  of  a  given 
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Type,  it  is  convenient  if  the  records  can  be  accessed 
sequentially  from  a  single  file,  T^.lso,  by  allocating  a  unique 
file  for  each  type,  we  obtain  an  elegant  formalization  for  the 
allocation  of  traces,  that  is,  the  mapping  cf  n-tuples  onto  the 
natural  numbers, 

Cne  obvious  way  to  perform  the  mapping  is  to  use  an  n- 
dimensional  array  [Lochovsky  1972  ],  An  element  A  {x  ( 1 ) , , , . ,x  (n) ) 
of  the  array  contains  a  pointer  to  the  node  described  by  the 
trace  [ x  ( 1 ) , . . , ,x  (n)  ].  However,  this  scheme  requires  a  lot  of 
space  for  pointer  elements,  many  of  which  will  never  be  present. 
We  will  try  to  circumvent  this  problem  by  using  pointerless 
allocation  schemes, 

1.3  Evaluating  an  Allocation  Scheme 

Minimally,  every  allocation  scheme  must  be  able  to 
perform  the  mapping  Mi  reproducibly  for  every  node  which  is 
present  in  the  data  base.  This  is  a  correctness  constraint.  In 
addition,  there  are  the  two  obvious  efficiency  measures;  time  and 
space  requirements.  The  mapping  should  be  ’’easy”  to  calculate, 
and  the  amount  of  wasted  memory  space  should  be  minimized.  There 
is  also  a  third  factor,  namely,  how  to  deal  with  a  data  base 
whose  size  varies  considerably  over  time.  We  continue  by 
examining  each  of  these  problems  in  more  detail. 

Cur  aim  is  to  find  a  numerical  function  to  implement  Mi 
without  pointers.  Since  pointer  chasing  on  disk  will  be  rare, 
time  overhead  in  disk  seeks  and  extra  accesses  will  not  be  a 
problem.  The  tradeoff  is  for  seme  ccraplexity  in  the  address 
calculation.  However,  all  the  functions  proposed  here  for  tMi  are 
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relatively  easy  tc  calculate.  Since  the  processor  time  sacrificed 
is  negligible  compared  to  the  improvement  in  I/O  behaviour,  we 
will  not  consider  processor  time  whan  analysing  the  costs  of 
specific  schemes, 

Ihe  second  performance  measure,  memory  utilizaxion,  is 
not  so  easily  dispensed  with.  It  is  true  that  by  eliminating 
pointers  we  have  cut  dcwn  considerably  on  the  amount  of  memory 
space  required.  However,  all  of  the  schemes  we  will  discuss 
suffer  from  the  problem  cf  fragmentation,  the  existence  of  empty 
holes  in  memory.  Unfortunately,  the  aifference  between 
fragmentation  loss  and  pointer  space  is  difficult  to  measure.  The 
fraction  cf  space  used  by  pointers  is  a  function  cf  record  size, 
while  fragmentation  is  independent  of  record  size.  Fragmentation 
could  be  a  serious  problem  when  record  size  is  large.  However,  we 
will  sidestep  the  problem  by  pointing  cut  that  saving  space  is 
not  the  primary  goal,  3y  eliminating  pointers,  we  no  longer  have 
problems  with  misdirected  links  or  with  long  disk  searches,  He 
expect  that  empirical  investigation  of  the  schemes  we  describe 
will  show  that  these  improvements  compensate  for  fragmentation 
losses. 

The  third  measure,  the  performance  of  the  scheme  on  a 
growing  data  base,  is  also  difficult  tc  quantify.  Allocation 
schemes  cover  a  range  frcm  completely  static  allocation,  where  a 
reorganization  over  the  entire  data  base  is  required  to  expand 
the  allocation,  to  completely  dynamic  schemes  where  records  are 
allocated  only  when  nodes  are  inserted.  One  way  to  handle  a 
growing  data  base  is  to  statically  allocate  space,  leaving  extra 


memory  to  handle  all  of  the  new  insertions 
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This  is  often  a  poor  soluticn,  When  a  data  base  is  small  ana 
growing  slowly,  it  is  clearly  undesirable  to  ccmmit  a  lot  of 
extra  space  fcr  non-existent  records.  A  better  solution  is  to 
allow  the  data  base  to  grew  dynamically ,  allocating  storage  cnly 
when  needed.  However,  this  is  not  easy  to  do.  Remember,  Mi  must 
be  able  to  find  the  address  of  every  present  node.  We  cannot 
merely  add  a  node  to  the  end  of  the  file  unless  Mi  can 
conveniently  reproduce  that  node’s  address  from  its  trace.  By 
eliminating  pointers  for  efficiency  reasons,  we  have  created  a 
difficult  logical  problem. 

Allocation  schemes  can  also  be  characterized  by  the 
assumptions  they  make  about  the  shape  of  the  data  base  tree.  A 
scheme  which  assumes  an  a  priori  distribution  of  nodes  in  the 
tree  is  said  to  be  predictive ;  it  bases  its  allocation  algorithm 
on  predicted  data  base  behaviour.  Storage  schemes  can  also  be 
adaptive.  An  adaptive  scheme  changes  its  algorithm  as  the  shape 
of  the  data  base  changes.  It  attempts  to  find  an  efficient 
allocation  algorithm  for  the  data  base  as  it  currently  exists. 
Dynamic  schemes  may  be  predictive  or  adaptive.  Static  schemes, 
though,  are  largely  predictive  in  that  they  assume  a  specific 
size  for  the  data  base.  Usually,  they  also  assume  that  the  data 
base  will  reach  an  equilibrium  value.  If  these  assumptions  are 
incorrect,  due  to  a  large  number  of  insertions  or  deletions,  then 
poor  performance  can  be  expected, 

1.4  Data  Ease  Behaviour 

In  analyzing  allocation  algorithms,  we  will  make  use  of 
a  probability  distribution  which  describes  the  general  shape  of 
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the  data  hase  tree,  let  Pi  be  a  cumulative  distribution  function 
for  type  i  such  that: 

Fi(x)  =  the  probability  that  a  filial  set  of  type  i 
has  <  X  nodes. 

Pi  is  a  step  function  vith  a  discrete  probability  density 
function,  fi. 

The  Fi  are  assumed  to  be  independent  across  types.  This 
may  not  always  be  true,  particularly  in  the  case  of  the  type  of  a 
node’s  parent.  We  might  infer  that  a  node  with  a  high  index 
within  its  filial  set  has  only  been  recently  inserted  into  the 
data  base.  A  new  node  probably  has  a  small  number  of  children, 
implying  that  the  size  of  the  filial  set  could  be  dependent  on 
the  parent’s  filial  set  index.  In  a  relatively  static  data  base 
independence  is  probably  reasonable,  since  insertions  and 
deletions  should  be  relatively  infreguent,  or  in  any  case  should 
balance  each  other  out.  In  a  dynamic  data  base  dependence  of 
filial  set  size  on  parent  may  have  to  be  considered  in  choosing 
an  allocation  scheme. 

To  find  the  probability  that  a  given  trace  exists,  we 
must  first  choose  a  method  of  assigning  indices  to  nodes  within 
filial  sets.  We  will  adopt  the  following  scheme.  When  a  node  is 
inserted,  it  is  assigned  a  trace  whose  last  index  is  one  higher 
than  the  youngest  member  of  the  filial  set  of  which  it  is  a 
member.  If  it  is  the  first  child,  it  is  assigned  the  index  1. 
Thus,  we  will  never  leave  an  "empty”  trace  in  the  middle  of  a 
filial  set  unless  a  deletion  has  occurred.  A  deleted  node  must  be 
considered  "present",  although  empty,  in  order  not  to  upset  the 
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trace  addressing  scheme.  ncn-present  trace  may  be  assigned 
later  to  a  new  insertion. 

To  simplify  the  notation  in  the  remainder  of  this  paper, 
we  will  concentrate  on  one  path  of  the  definition  tree.  We  will 
assume  that  the  type  of  the  node  T[ x  { 1 )  ^ , x  (n)  ]  is  n  and  that 
each  X (i)  is  an  index  in  a  filial  set  of  type  i.  This  allows  us 
tc  write  traces  as  vectors  [ x  ( 1 x (n)  ]  without  a  type  prefix 
and  to  tahe  sums  and  products  over  sequences  of  integers  rather 
than  sets  of  types.  This  assumption  does  not  limit  the 
applicability  of  any  of  our  schemes . 

Osing  the  fact  that  node  [x{1)  ,...,x{n)  ]  being  present 
implies  node  [ x  { 1 x  (n) - 1  ]  is  present,  we  can  now  write  the 
probability  of  a  trace  [ x H )»•••/ x  ( n)  ]  being  present. 

n 

Prob([x(1),...,x(n}  ])  =  H  Prob{a  filial  set  of  type  i  has 

i= 1  >  x(i)  members) 

n 

=  n  (1  “  Prob (filial  set  of  type  i 
i=  1  has  <  X  (i)  members)) 

n 

=  n  {1  -  Pi(x(i)-1))  (eq.  1-1) 

i=1 

If  the  li  vary  considerably  from  type  to  type,  eg,  1-1 
will  be  far  too  complex  tc  analyze  for  any  reasonable  n  (i.e., 
n>3) ,  However,  the  functions  li  can  be  useful,  as  they  give  a 
concise  way  of  describing  the  average  behaviour  of  the  data  base. 
We  will  use  them  to  generate  sample  data  bases  for  simulations  to 
be  described  later.  Also,  these  distributions  will  provide  a  nice 
way  of  analyzing  the  expected  amount  of  fragmentation  incurred 
with  a  given  allocation  scheme. 


■  ■■  -  £:vo'v 

-'  T'/ ■«, 


♦.lit-  .  :/Si'lt, 


.=  '■  .*  ,  '  V>.j  ,  i  l7-  ,.t  <:  ■:  *  /J  VV.’;,imiZS'JI!Bl».‘l)i';^  _  ^ 

.  ^  '»  -'^,r  'B"  :,  •  M\.l»-X-.«  .^'’^iJlS.' ^  ..-'■^'^•’t.'^'AillHHUM  *.  s.  i 


■  r 


Ai 

,^. ..  ^V*a.  4cf 


'  '.  »  '  'v.  '  ■'' 

^AffX  ^,fS'X\>'  WPifis;  .  i^/^.y..  5%.  .„-4  i'4'W< 

'fe’'  "  ■■'',■■■"'  '■  ■  ■  5-  ,  ,'  . ,  ,  - 


I  > .  , 'i '  )  ^  '!  ^ 


■JM  p 


--9v:»  ’f  ,-■.'4(1^  ■  ' '  ■  ■'^''?*f  \: 


'«t 


I’ 


y.x 


‘  M 


'  i 


•;  (ilii  ■  .'  . ' 


-i?»v 


*''¥i 


Vt. 


|[  W**  ''‘>^  ■■•  ''V^i^ 


f^.i 


>Jtf^ 


?>i 


V  .iff  -j. 


m 


V  ‘'15^^';--,^  .•  .  ,,  I 


■  .TSur;,  j  fr.-;  •  '..V^^,  ''  -,'  dk  '^'' 

r,r..4^t 

^!3fSy;.t  *4'^  '  ‘ 


i  i  . 

1  -M . 


■feiT  I 


:'*^..'i^u  '"7^1  'f 

..V?. 


■"^‘-fii^:||ysi 

'vi'rr'*  ::. 


'»  ••x-r 


«.  •*» 


fc/-,;  ll" 


1.5  Specific  Schemes 

In  the  following  sections  two  classes  of  pointerless 
schemes  are  discussed.  In  section  2  we  examine  functions  which 
hash  the  traces.  Some  theoretical  observations  on  hashing  traces 
are  presented  and  simulation  results  on  more  practical  hashing 
functions  are  described.  In  section  3  we  examine  functions  Mi 
that  are  both  1-1  and  onto.  lie  lock  at  how  an  optimal  (but 
unfeasible)  Mi  would  enumerate  traces.  Several  practical 
bijective  Mi’s  are  presented.  A  number  of  schemes  using  these 
functions  are  described  and  evaluated. 
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2 •  Ha shi nq  Traces 

hashing  function  is  a  mapping  frcra  keys  onto  addresses 
vjhere  in  general  the  number  of  possible  key  values  is  much 
greater  than  the  number  of  addresses.  The  number  of  keys  v/hich 
are  actually  present  is  close  to  the  number  of  addresses.  The 
goal  of  the  hashing  function  is  to  distribute  the  keys  evenly 
over  the  address  space.  If  two  records  which  are  present  have 
keys  which  map  onto  the  same  address,  then  a  coll ision  tak^s 
place.  Only  one  record  can  be  stored  at  the  location;  the  other 
is  stored  at  another  location  whose  address  is  obtained  by  a 
pointer  from  the  first  or  by  an  address  calculation  (e.g.,  add 
some  constant  to  the  first  address,  or  rehash  the  key  with  a  new 
function) .  A  later  attempt  to  retrieve  one  of  these  two  records 
may  take  two  accesses  to  memory  (if  it  was  the  second  one  to  be 
inserted) » 

The  keys  we  are  interested  in  hashing  are  traces,  A 
hashing  functicn  for  traces  corresponds  to  the  mapping  Mi  of 
section  1.3,  where  the  range,  ll,  is  bounded.  Recall  that  each 
trace  type  will  be  mapped  onto  its  own  file,  so  we  will  limit  our 
attention  to  the  general  case  of  traces  of  length  n  and  a  file 
with  A  addresses  {1,2,,,., A}, 

A  good  hashing  function  will  minimize  the  expected 
number  of  collisions.  The  performance  of  a  hashing  function  can 
be  measured  by  counting  the  average  number  of  accesses  required 
to  retrieve  a  record.  The  fewer  number  of  accesses  required  the 
better.  If  there  are  B  records  present  that  have  equal 
probability  of  being  accessed  and  there  are  A  addresses  in 
memory,  then  the  best  average  number  of  accesses  per  record  we 
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can  expect  is  max{1^S/A}«  Obviously,  this  "cptimal"'  performance 
is  unattainable.  As  pointed  out  by  Knuth  [1973,  p.  527],  the 
worst  case  of  maximum  number  of  ccllisions  is  so  bad  that  we  must 
be  assured  that  the  average  behaviour  is  quite  good.  Therefore, 
it  is  average  performance  that  will  be  the  object  of  our 
analy  sis . 

Theoretically,  the  more  the  hashing  function  knows  about 
which  keys  are  expected  to  cccur,  the  better  it  can  randomize 
these  keys  over  the  address  space.  If  we  assume  that  the 
distribution  function.  Pi,  is  known  for  each  type,  then 
presumably  this  information  should  help  us  randcmize  traces.  In 
fact,  knowing  the  Pi  allows  us  to  construct  a  hashing  function 
with  the  property  that  the  expected  fraction  of  the  present 
traces  which  hash  onto  each  address  in  the  file  is  ’’almost”  equal 
to  1 /A  for  all  addresses.  That  is,  if  P  records  are  present,  for 
any  address  d  {1<d<A)  the  expected  number  of  traces  which  map 
onto  d  is  E/A.  This  hashing  function  works  by  using  the  functions 
Fi  to  recursively  partition  the  interval  [1,A]  into  subintervals, 
so  that  the  length  of  each  subinterval  is  directly  proportional 
to  the  expected  number  of  traces  which  hash  onto  it.  A  detailed 
description  of  the  algorithm  is  found  in  Appendix  I. 

Although  the  even  spread  guaranteed  by  this  hashing 
function  is  probably  the  best  we  can  hope  for,  it  is  not 
necessarily  optimal  with  respect  to  the  number  of  collisions.  The 
probability  of  a  collision  is  the  probability  that  two  or  more 
nodes  mapping  to  the  same  address  are  both  present.  The  expected 
number  of  collisions  is  a  complicated  function  of  the  Fi  and  the 
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hashing  funcxion.  We  are  currently  investigating  how  far 
even  spread  is  from  the  minimum  average  access  time. 


Even 

the  hashing  algorithm 

of  Appendix 

I  lacks 

D 

J. 

r act ical 

significance. 

however.  First,  it  is 

slow.  For 

reasonab 

ly  long 

traces,  the 

address  transfer mati 

cn  requires 

quite 

a 

bit  of 

arithmetic.  A  more  fundamental  problem,  common  to  all  hashing 
functions,  is  that  the  Fi  are  generally  not  known  in  advance,  A 
practical  hashing  function  should  assume  very  little  about  the 
exact  distribution  of  the  keys,  since  such  infcrmation  is  rarely 
available  and  is  continually  changing.  Hence,.  we  must  look 
elsewhere  for  a  practical  method  of  hashing  traces. 

2. 1  Experimental  Eesults 

There  have  been  some  empirical  results  cn  practical 
hashing  functions.  For  example,  Lum,  Yuen  and  Eodd  [1971] 
experimented  cn  real  files  with  a  variety  of  hashing  functions. 
In  their  experiments,  they  varied  the  ratio  of  number  of  records 
present  to  memory  size  (called  the  loading  factor^  and  determined 
the  average  number  of  accesses  per  record  for  eight  different 
functions.  They  found  that  overall  the  division  algorithm,  which 
divides  the  key  by  the  range  size  and  uses  the  remainder  as  the 
address,  performed  best  under  a  wide  range  of  conditions.  Knuth 
[1973]  concurs  in  citing  the  division  algorithm  as  one  of  the 


best  of 

those  currently  known 

.  Therefore, 

in  locking  for  an 

efficient 

way  to 

hash  traces. 

we  settled 

on  the  division 

algorithm 

as  our 

prime  candid 

ate.  He  then 

ran  a  series  of 

experiments  to  test  its  behaviour. 
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In  all  cf  the  practical  schemas  tc  be  discussed,  ve 
assiciate  with  each  type,  i,  a  non-negative  integer,  max  (i)  , 
which  denotes  the  upper  bound  for  the  size  of  any  filial  set 
containing  type  i  nodes,  r^.lthough  the  max  (i)  are  not  essential  to 
any  of  the  allocation  schemes  discussed  in  this  paper,  they  make 
various  calculations  of  bounds  a  little  more  convenient.  Many 
commercial  tree  structured  data  base  systems  require  the  user  to 
define  the  max  (i)  . 

Since  the  trace  must  be  transformed  into  an  integer 
before  the  division  algorithm  can  be  applied,  a  method  for 
mapping  traces  into  integers  was  needed.  In  our  experiments  we 
tried  three  methods: 

1,  multiply  the  trace  indices  together; 

2,  concatenate  the  trace  indices  together; 

3,  consider  the  trace  to  be  a  number  in  the  mixed  base 
[max(1),,  max(2),...,max{n)]  and  map  it  into  a  base  10 
number. 

In  each  case,  the  resulting  integer  was  divided  by  the  number  of 
slots  in  memory  and  the  remainder  was  taken  to  be  the  address. 

To  choose  sample  sets  of  traces  on  which  to  run  our 
hashing  experiments,  we  obtained  data  on  the  distribution  of 
filial  set  size  from  a  local  IBM/IMS  installation.  These  results 
are  detailed  in  Appendix  II,  filial  set  sizes  in  this  data  base 
are  small.  The  shapes  of  the  filial  set  size  distributions  do  not 
follow  any  classical  pattern,  such  as  exponential  or  uniform 
distribution.  For  x>0,  f i  (x)  usually  follows  a  triangular 
distribution  (e.g,,  fig.  2-1).  f i  (0)  is  an  anomaly  in  nearly  all 
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cases.  These  empirical  distr ihrtions  led  us  ic  model  ihe  fi  using 
triangular  density'  functicns. 

^^e  tested  each  of  the  three  hashing  functions  on  traces 
of  length  .four  and  six.  Each  sample  set  of  traces  was  constructed 
by  growing  a  tree  in  a  depth  first  manner  by  randomly  selecting 
filial  set  sires  from  the _f i.  Filial  set  sizes  were  distributed 
triangularly  with  ranges  (i.e.,  [a,b]  in  fig.  2-1)  of  [1^3], 
[1,7]  and  [1,9].  The  leaves  of  each  tree  were  then  hashed  at  a 
variety  of  loading  factors,  and  the  average  number  of  accesses 
per  node  was  determined. 

The  results  are  summarized  in  figure  2-2.  Each  curve 
represents  the  average  of  about  75  data  points  in  the  entire 
range  of  loading  factors.  There  was  a  considerable  range  of 
values  for  the  average  number  of  accesses  per  node  at  each 
loading  factor,  thus  shedding  some  doubt  on  the  confidence  level 
of  the  results.  Since  we  are  still  vary  uncertain  as  to  how 
typical  our  sample  data  base  was,  no  elaborate  statistical 
analysis  was  attempted. 

Our  goal  was  to  get  an  idea  of  what  kinds  of  trace-to- 
integer  mappings  would  yield  good  results  when  used  in 
conjunction  with  the  division  algorithm.  As  expected,  the 
multiplication  algorithm  was  collision  prone  when  used  on  small 
filial  set  sizes.  Largs  filial  set  sizes  should  improve  that 
method’s  performance  somewhat,  but  probably  not  to  a  reasonable 
level.  The  mixed  bass  and  concatenation  methods  performed  well, 
comparing  quite  favourably  to  the  best  hashing  functions  tested 
by  Lum,  Yuen,  &  Dodd  [1971].  Although  the  amount  by  which 
concatenation  outperformed  mixed  bass  may  possibly  not  be 
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significant^  the  computational  simplicity  cf  the  concatenation 
method^  and  its  independence  of  the  max  (i)  tend  to  recommend  it 
anyway. 

The  results  of  this  experiment  are  far  from  conclusive 
for  several  reasons.  First,  the  trees  we  hashed  were  relatively 
small.  Second,  we  have  only  seen  filial  set  size  distributions 
for  a  single  data  base.  Getting  these  disiributic ns  for  a  large 
data  base  is  expensive;  it  cost  us  an  hour  of  IBM  370/158  CPU 
time  (twelve  hours  core  resident  as  a  background  job)  to  do  a 
depth  first  search  of  a  data  base  with  about  1.8  million  nodes. 
To  obtain  reasonable  certainty  cf  typical  filial  set  size 
distributions,  many  such  data  bases  would  have  to  be  examined.  In 
conclusion,  although  the  evidence  does  suggest  that  some  simple 
hashing  functions  for  traces  work  well,  more  extensive  tesis  on 
existing  data  bases  will  have  to  he  tried  before  the  methods  can 
be  used  with  seme  confidence. 
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Figure  2-1  A  Triangular  Distribution 
for  Filial  Set  Size 
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Figure  2“2a  Plashing  function  performance 

on  length  three  traces 
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2.2  Dynainic  Properties 

The  results  we  obtained  for  the  average  number  of 
accesses  per  record  indicate  that  effecxive  hashing  functions  do 
exist  for  traces.  However,  the  technique  cf  hashing  does  have  an 
important  limiting  factor.  Before  any  records  can  be  stored,  one 
must  choose  a  file  size,  i,e,,  the  number  of  addresses  in  the 
range.  If  the  expected  number  of  present  records  was 
underestimated,  the  loading  factor  may  climb  quite  high,  perhaps 
even  past  one.  When  this  occurs,  drastic  action  is  usually 
required. 

The  only  feasible  solution  to  reduce  the  accesses  per 
record  ratio  is  to  extend  the  file.  But  now  the  hashing  algorithm 
will  produce  incorrect  addresses  for  these  records  already 
present,  so  all  the  present  records  must  be  rehashed  and 
relocated  [Bays  1972].  Morris  [1968]  points  out  that  rehashing 
can  be  avoided  by  using  virtual  hash  addresses.  The  virtual  range 
and  the  real  range  are  both  integer  powers  of  two,  say  2*^V  and 
2^=<=R  respectively,  with  V>S.  Now,  the  S  lowest  order  bits  of  the 
virtual  hash  are  used  to  find  the  real  address.  Memory  size  can 
be  extended  at  any  time  to  B'>B;  although  relocation  is  still 
necessary,  rehashing  is  not  since  the  R’  lew-  order  bits  of  the 
virtual  hash  can  be  used  for  the  address.  The  method  also  works 
for  compressing  the  fils  when  the  loading  factor  is  very  small. 

Still,  relocating  a  file  full  of  records  is  not  very 
appealing.  If  the  data  bass  is  volatile,  frequent  relocation  is 
intolerably  expensive.  Hence,  a  hashing  function  cannot  be  used 
as  a  trace-to-addr ess  transformation  under  dynainic  conditions, 
although  it  may  be  useful  for  overflow  management  in  a  more 
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dynamic  scheme.  For  seme  data  hases,  file  sizes  are  static; 
indeed,  even  within  a  volatile  data  tase  some  types  may  be 
static.  In  these  cases,  hashing  should  yield  efficient  results 
with  a  minimal  amount  of  reorganization  required. 
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3 .  Bi ject ive  Tracj-to- address  Maps 

To  help  us  with  the  problem  of  growing  data  bases,  we 
new  focus  our  attention  on  Mi  mappings  which  are  both  one-to-one 
and  onto  {i.e.,  bijective) .  A  bijective  Mi  linearly  orders  the 
traces  according  to  their  images  under  Mi.  If  Mi  enumerates  the 
traces  in  approximately  the  same  order  that  the  traces  are 
inserted  Into  the  data  base,  then  the  data  base  can  conveniently 
grow  merely  by  extending  the  existing  file  as  new  nodes  are 
inserted.  When  a  trace  appears  whose  address  is  beyond  the 
current  extent  of  the  file,  then  either  the  file  can  be  extended 
or  the  trace  can  be  hashed  into  an  overflow  area.  A  crowded 
overflow  area  is  handled  by  incrementing  the  current  upper  bound 
fer  the  file  (called  the  cut point)  to  absorb  those  overflow 
records  whose  addresses  are  only  slightly  larger  than  the  old 
outpoint.  The  success  of  this  type  of  cutpeint  scheme  depends 
primarily  on  finding  a  bijective  Mi  which  orders  traces  in  a 
’’nice"  way  and  developing  an  algorithmic  way  of  choosing  a 
outpoint  that  limits  wasted  memory  space  and  avoids  overcrowding 
in  the  overflow  area. 

3  » 1  Pairing  Punctiens 

A  pairing  function  is  a  'bijective  mapping,  M, 

M:  M  X  N  ->  N 

where  N=  {1 , 2, 3, .  .  .  To  map  a  trace  [  x  ( 1)  , .  . .  ,  x  (n)  ]  into  an 

address,  the  pairing  function  M  can  be  applied  recursively  to 
elements  of  the  trace  so  that  Mi  {[  x  ( 1)  , .  .  .  ,  x  (k)  ])  = 

M  (  (.  .  .  M{M  (M  (X  (1)  ,x  (2)  )  ,x(3)  ),  x  (4)  )...),  x  ( k)  )  .  Using  different 
pairing  functions  for  different  types  gives  us  seme  control  over 
the  order  in  which  Mi  enumerates  traces.  Some  examples  are 
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appropriate  to  give  the  reader  seme  irituition  on  how  pairing 
functions  are  con s fruc ted . 

The  best  known  type  of  pairing  function  is  the  simple 
array  calculation  (see  figure  3-1a) ,  Columns  (or  rows)  of  the 
array  are  sequentially  allocated  so  that  finding  an  address  for  a 
pair  [x,y]  involves  skipping  the  first  x- 1  columns  and  choosing 
the  y-th  element  of  the  x-th  column.  Another  pairing  function 
could  enumerate  by  diagonal  elements  (figure  3-1b).  In  this 
mapping  newly  allocated  x's  have  fewer  associated  y*s  than  older 
allocated  x’s.  Figure  3-1c  illustrates  a  diagonal  mapping  with  a 
bound  on  values  fer  y. 
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Figure  3-1  Examples  of  Pairing  Functions 


(The  arrows  indicate 
the  layout  of  the 
x,y  elements  on  se¬ 
quential  memory.) 


M(x,y)  =  Cx-1) • Y  +  y 


(a.)  Bounded  rectangu¬ 
lar  pairing  func¬ 
tion  (i  .  e  .  ,  array 
calculation) 


y - > 


(b) 


+7 


Unbounded  diagon¬ 
al  pairing  func¬ 
tion  (i  . e  .  ,  "high 
numbered  parents 
have  few  children’) 


y - >  Y 


else  M(x,y)  =  (x+y-  3/2)Y  -  Y^+ 
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diagonal 

function 
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The  number  of  conceivable  pairing  functions  is  clearly 
limitless.  However^  not  all  pairing  functions  make  sense  for 
enumerating  traces.  An  admissible  pairing  function  for  traces 
must  enumerate  [x^y]  before  Ix,y+1]^  since  [x,y-?-1]  is  never 
inserted  before  [x,y].  In  addition,  the  pairing  function’s 
behaviour  should  correspond  to  our  observations  of  how  filial 
sets  grow.  In  this  respect  there  are  basically  two  viewpoints  we 
can  take;  either  filial  sets  all  have  the  same  size  (i.e., 
rectangular  pairing  function)  or  filial  sets  whose  parents  have 
higher  valued  indices  are  smaller  than  these  whose  parents  have 
lower  valued  indices  (i.e.,  diagonal  pairing  function).  This  is  a 
gross,  but  necessary,  simplification  over  the  generalized 
distribution  functions,  Fi.  The  functions  Fi  describe  filial  set 
size,  but  do  not  distinguish  among  children  sets  of  specific 
parents.  That  is,  if  a  node  of  type  Tx  has  children  of  type  Ty, 
the  actual  trace  of  a  Tx  node  gives  no  hint  as  to  how  big  its  Ty 
filial  set  is.  This  has  the  intuitive  justification  that  the 
index  of  a  node  within  its  filial  set  should  give  no  information 
as  to  how  many  children  it  will  have-  In  addition,  rectangular 
pairing  functions  allocahe  filial  sets  contiguously,  yielding 
nice  buffering  charac+eristics  when  an  entire  filial  set  must  be 
retrieved.  This  line  of  argument  suggests  a  rectangular  pairing 
function.  However,  nodes  which  have  recently  been  inserted 
generally  have  a  smaller  number  of  children  than  those  which  have 
been  present  for  awhile,  suggesting  a  diagonal  pairing  function. 
We  will  have  more  to  say  on  the  subject  of  which  pairing  function 
is  mere  appropriate.  Our  present  point,  though,  is  that  the 
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3  1 


choice  will  net  be  affected  much  by  the  Fi  for 


the  type  which 


is 


being  enumerated, 

3 . 2  Optimal  Biject ire  Mappings 


Despite  the  intuition  which  tells  us  net  to  conside 
filial  set  distributions  when  choosing  a  pairing  function, 
are  some  results  which  describe  the  best  possible  mappings 
data  base  behaviour  is  known.  To  evaluate  the  behaviour  of 
bijective  address  mappings,  we  define  the  measure  g(t): 

g  (t)  =  present  (t) 

allocated (t) 


:he 


here 


when 

the 


where  allocated  (t)  is  the  number  of  records  allocated  at.  time  t 
and  present{t)  is  the. number  of  nodes  present  at  time  t.  Clearly, 
q  (t)  should  be  close  to  one  for  good  performance.  In  addition,  we 
will  adopt  the  allocation  rule  that  all  nodes  which  are  present 
must  be  allocated  (i.e,,  they  cannot  be  handled  as  overflows). 
Thus,  if  X  is  the  type  i  node  whose  image  under  Mi  is  the  maximum 
of  all  type  i  nodes  which  are  present  at  time  t,  then 
allocated  (t) =Mi (x)  . 

Let  TSi  be  a  vector  of  traces  which  lists  the  type  i 
nodes  in  the  order  that  they  are  inserted  into  the  data  base.  For 
simplicity,  we  assume  that  the  j-th  trace  in  TSi  is  inserted  at 
time  j.  For  a  given  T?.i,  the  bijective  Mi  that  maps  TRi(j)  into 
address  j  for  all  j  is  optimal  with  respect  to  g  (t) ,  since  q{t)=1 
for  all  t.  Similarly,  given  Mi,  the  vector  of  traces  which 
behaves  best  under  Mi  with  respect  to  g  (t)  is  TSi  such  that 
Mi(TSi(j))=j  for  all  j.  Neither  of  these  propositions  consider 
deletions . 
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These 

observations  are 

not 

very  useful. 

since  future 

insertions  into 

the  data  base  are 

not 

predictable. 

A  somewhat 

Ecre  realistic  assuopticr  is  that  ws  know  I’i(x,t)  ,  that  is«.  the 
distribution  ri{x)  at  time  t.  ?i(X;t)  can  he  obtained  by 
recording  a  histogram  of  filial  set  sizes  for  each  type  as  nodes 
are  inserted  into  the  tree.  Given  Fi(x,t) ,  we  can  obtain 
?(x(j),t),  i.e.,  the  probability  that  a  trace  x ( j)  is  present  at 
time  t.  (The  subscript  i  for  p  is  excluded  to  simplify  notation,) 
Ws  can  then  show  that  any  admissible  Mi  which  minimizes  g  (t)  at 
time  t  enumerates  the  traces  according  to  decreasing  p(^(j)/t.). 


Lemma  Let  x(1)^  x(2),  x(m)  be  an  admissible  ordering  of  m 

different  traces  cf  type  n.  If  i,  1<i<ra-1,  is  an  index  such  that 

p  (X  (i)  ,t)  <p  {x  (i+l)  ,t)  ,  then  x{1),  x(2),  x(i+'I)/  x(i), 

X  {m)  has  a  larger  value  for  the  expected  value  cf  g  (t)  than  x(1), 
X  (2)  ,  .  .  .  ,  X  (i)  ,  X  {i+1)  ,  . .  . ,  x(m)  , 

Proof 


g  (t)  =  present  (t) 

allocated  (t) 

E(g(b))  =  S  (present  (t)  )  note :  E(x)  means  ”the 

E  { allocated (t) )  expected  value  of  x" 

From  the  definition  of  expected  value,  we  calculate 

m 

E  (present  (t)  )  =  e  F(x{j),t) 


which  is  independent  of  the  ordering  of  the  x(j). 

Let  El  =  E  (allocated  (t)  )  when  x(i)  precedes  x(i+1) 

E2  =  E  (allocated  (t)  )  when  x(i+1)  precedes  x  (i) 
We  want  to  show  E2  <  SI  to  prove  the  lemma. 
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ra 

El  = 

E 

[  j  ■  P  (2  ( j) 

, t)-Prob{¥  k>j,  x{k)  is  not 

3=1 

present )  ] 

i-1 

E2  = 

E 

C  j-P(2(j) 

y  t)  •  Prob  (¥k>  j  j,  x{j)  not 

j=1 

present)  ] 

i'P(x(i^'T)yt)-Erob(x(i)  not  present 
and  x  (k)  not  present) 

-3-  {i+l)  •  p  (x  (i)  ,t)  *  Prob  {¥k>i+ 1  ,x  ('x)  not 

present) 

m 

+  E  I  j*  P  (x  { j)  ,  t) .  Prob  (¥k>  j,x  (k)  not 
j=i+2  present)  ] 

Nodes  X  ( 1)  »■  •  •  • /X  {i“1)  and  x  {i- 1)  ,  . .  . , x  (m)  contribute  the  same 

total  value  to  both  El  and  E2.  The  only  change  is  due  to  the 

contributions  of  x  (i)  and  x  (i+l)  .  The  difference  between  the 

x{i+1)j,  X  (i)  ordering  and  the  x  (i)  i,  x(i^1)  ordering  is 

E2  -  E1  =  i- p  {X  (i+ 1 )  ,  t)  -  (1  -  p  (X  (i)  ,  t)  )  •  Prob  {¥k>i+ 1 ,  x  (k)  not 

present) 

(i+ 1)  •  p  (X  {i)  ,  t)  •  Prob  (¥k>i+ 1 ,  X  (k)  not  present) 

-  i-p(x(i),t)*  (1  -  p  {x  (i+ 1)  ,  t)  )  •  Prob  (¥k>i+ 1 ,  x  (k)  not 

present) 

-  (i+ 1)  •  p  (X  {i+ 1 )  ,  t)  •  Prob  (¥k>i+ 1 ,  x(k)  not  present) 
Multiplying  and  collecting  terms  we  obtain 

E2  -  El  =  Prob  {¥k>i+ 1 , X  (k)  not  present)*  [ p  (x  (i)  ,t) -p (x  (i+1 ) ,t)  ] 
Eut  p(x(i)rt)  <  p{x(i+1),t)  by  hypothesis.  Thus,  E2-S1  <  0 
implying  that  the  x(i+1),  x(i)  ordering  increases  the  value  of 
E{q{t)). 


Theorem  For  a  given  t,  the  maximum  value  for  E{q(t))  is 

obtained  by  mapping  the  nodes  onto  N  such  that  if 
p{x(k),t)  <  p(x(k+1),t),  then  Mi  (x  (k)  )  >  Mi{x{k+1)). 
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Proof  Assume  that  Mi(x(k))  <  Hi(x(k-!-1))  and  p(x(k),t)  < 

p(x(k+1),t).  Then,  by  the  previous  lemma,  '.>re  can  switch 
X  (k)  and  x(k-’-1)  to  yield  an  improvement  in  E(g(t)), 
which  is  a  contradiction. 

While  the  theorem  does  guarantee  that  the  tijective  Mi  which 
orders  according  to  decreasing  p(x,t)  is  optimal  at  time  t,  it 
does  not  say  that  an  Mi  which  is  optimal  for  all  t  is  possible. 
To  find  such  an  optimal  Mi,  strong  restrictions  must  be  placed  on 
the  behaviour  of  p(x,t)  over  time,  A  set  of  nodes  may  be  ordered 
ty  decreasing  probability  at  time  t,  but  at  time  t+1  some  nodes 
may  change  probability  sufficiently  to  alter  the  ordering.  Only 
if  the  ordering  of  nodes  ty  probability  is  the  same  over  time 
will  Mi  be  optimal  with  respect  to  E  (g  (t) )  for  all  t.  However, 
even  if  this  is  not  the  case,  intuition  about  the  expected 
probability  over  time  can  be  used  to  obtain  an  approximately 
decreasing  seguence  of  probabilities.  Providing  there  are  no 
gross  anomalies  (i.e.,  a  high  probability  node  mapped  onto  a  high 
address),  E(g{t))  should  be  near  its  minimum  value  most  of  the 
time . 

3  *  3  Cut point  Schemes 

Using  the  results  of  the  previous  sections,  we  would 
like  to  find  a  bijective  mapping.  Mi,  that  enumerates  the  traces 
in  approximately  decreasing  probability.  The  algorithm  for  the 
overall  scheme,  given  Mi,  is 

1 ,  Choose  a  outpoint,  ci,  which  bounds  the  primary  storage 
area.  If  a  trace,  x,  is  such  that  Mi(x)<ci,  then  the 
trace  is  stored  in  the  primary  area.  Otherwise,  it  is 
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hashed  iniro  an  overflow  area  by  one  of  the  methods 
described  in  section  2, 


o 

As 

the  data  base 

fills,  ci  is  increased. 

Every  tin 

e  ci 

is 

changed  to  ci * , 

the  overflow  area  is 

searched 

for 

those  X  with  Hi 

(x)<ci%  which  should 

be  moved 

into 

primary  storage. 

There  are  two  criteria  fcr  choosing  the  cntpoint;  the  outpoint 


should 

be  low  enough 

that 

the  primary  area  has 

relative 

ly  high 

memory 

utilization 

and 

high  enough  that  the 

overflow 

area  is 

small 

to  facilitate 

quick 

reorganization  {i.e,. 

step 

2 

above)  . 

These 

are  clearly 

conflicting  goals.  A  good  in 

termedia 

te  value 

for  ci 

must  be  found 

that 

limits  the  amount  of 

wasted 

storage 

s  pace 

without  forcing 

frequent  execution 

of 

the 

costly 

reorganization . 

We  want  a  method  of  finding  a  good  value  for  ci.  There 
are  three  basic  costs  associated  with  this  scheme:  storage  space, 
access  time,  and  reorganization  time.  Storage  cost  is  simply  the 
amount  of  space  consumed  in  the  primary  and  overflow  areas. 
-Access  time  includes  the  amount  of  time  required  to  access  both 
primary  and  overflow  records.  Since  all  address  calculation  is 
done  numerically,  the  difference  in  access  time  between  primary 
and  overflow  records  will  be  insignificant.  Therefore,  average 
access  time  will  not  be  affected  by  the  choice  of  curpoint  and 
can  be  desregarded  in  the  analysis.  Reorganization  costs  are  more 
complex  and  will  be  treated  separately. 

The  reorganization  cost  is  a  function  of  the  cost:  for  a 
single  reorganization  and  the  frequency  with  which 
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r eorg ar.izations  must  be  done.  When  the  overflow  area  fills  up, 
the  system  can  take  one  of  tv/o  possible  actions:  increase  the 
cutpoint  and/or  increase  the  size  of  the  overflow  area.  Both 
actions  can  be  accomplished  in  time  proportional  to  the  size  of 
the  overflow  area.  To  increase  the  cutpoint,  each  overflow  record 
is  examined;  if  it  maps  onto  an  address  less  than  the  new 
cutpoint,  then  it  is  moved  into  the  primary  area.  Bayes  [1972] 
has  an  algorithm  to  extend  hash  tables  by  a  sequential  pass  ever 
the  table.  This  procedure  can  be  applied  concurrently  with 
cutpoint  reorganization  to  increase  the  size  of  the  overflow 
a  lea. 

How  frequently  reorganization  is  necessary  depends  on 
the  insertion  rate  into  the  data  tase  (i.e.,  how  may  nodes  are 
inserted  per  unit  time) .  If  K  percent  of  the  insertions  go  into 
the  overflow  area ,  then  the  frequency  of  reorganization  is  K/1 00 
times  the  insertion  rats  divided  by  the  amount  of  empty  space  in 
the  overflow  area . 

To  find  the  best  cutpoint,  we  begin  by  defining  the 
following  variables: 

NP  number  of  nodes  present  in  the  entire  data  base 
NAP  number  of  nodes  allocated  in  the  primary  area  (i.e., 
NAP=ci  in  the  above  scheme) 

OV  size  cf  the  overflow  area 

POV  number  of  present  nodes  in  the  overflow  area 
EOV  number  of  empty  spaces  in  overflew  area  (i.e.,  EOV  =  07 
-  POV) 

LF  maximum  allowable  loading  factor  in  overflow  area 
IR  insertion  rate 
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S  storage  cost 
R  .  reorganization  cost 
TOT  total  cost 


Assuming  negligitle  cost  due  to  access  time,  the  total  cost  for  a 
given  cutpcint  scheme  is 


TOT  =  S  +  R  (eg.  3-1) 

S=  K1  •  (WAP  +  OV) 

R  =  K2-  OV  •  {IR-P07-LP)/  (NP-ZOV) 

PCV/NP  is  the  fraction  of  the  insertion  rate  which  contributes 
nodes  to  the  overflow  area.  (LP  POV  IR)/(NP  EOV)  is  the 
reorganization  frequency  of  th^  overflow  area.  K1  and  K2  are 
constant  costs  associated  with  space  and  time  respectively.  The 
equation  for  TOT  is  what  we  must  optimize,  with  NA.?  and  OV  being 
the  free  variables.  We  now  turn  our  attention  to  a  specific 
cutpoint  scheme  which  uses  pairing  functicns. 

Assume  cnce  again  that  the  density  functions,  fi,  are 
known.  Recall  the  discussion  at  the  end  of  section  3.1  which 
suggested  a  rectangular  or  diagcr.al  pairing  function  as  the  most 
appropriate  in  implementing  a  bijective  Mi,  In  the  following 
analysis  we  will  use  a  rectangular  scheme.  Our  choice  is  based 
largely  on  its  convenient  analytic  properties.  It  turns  out  that 
the  results  obtained  are  applicable  to  a  bounded  diagonal  pairing 
function  (e.g.,  figure  3-1c)  with  only  minor  changes.  Whether  the 
intuition  which  favours  bounded  diagonal  pairing  functicns  over 
rectangular  cnes  is  justifiable  can  cnly  be.  determined  by 
e  xper imen tation . 
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Let  d{i)  be  the  width  of  pairing  function  allocation  for 
nodes  of  type  i  (i»e.,  Y=d  (i)  in  figure  3- la).  Pi  will  denote  the 
pairing  function  for  type  the  Pi’s  differing  only  in  their 

constants,  d(i),  A  trace  [ x  ( 1 x  (n)  ]  is  mapped  into  an 
address^  Mn ([ x { 1 x (n) ])  by  applying  the  pairing  functions 
from  the  left  as  fellows: 

Pn{  (.  .  .P4  (P3  (P2(x  (1)  ,x(2)  )  ,x  (3)  )  ,x{4)  )  ,.  .  .)  ,x(n)  ) 

If  any  x (i)  is  greater  than  its  corresponding  d  (i)  ,  then  the 
trace  does  not  map  into  the  primary  area  and  must  be  handled  as 
an  overflow.  If  d(i)=max(i)  for  all  i,  then  all  the  nodes  will 
map  into  the  primary  area.  As  d  (i)  decreases,  more  nodes  become 
overflows;  at  the  same  time  a  lower  fraction  of  rhe  primary  area 
is  empty,  since  the  high  values  of  x (i)  are  the  most  sparsely 
populated.  Hence,  choosing  a  cutpeint  in  this  scheme  reduces  to 
choosing  values  for  each  d (i) . 

Assume  we  are  allocating  files  for  types  T(1),  T (2) , 
T(n),  where  I  (i)  is  the  parent  of  T(i+1)  in  the  definition 
tree.  We  allocate  for  T{1)  exactly  the  number  of  nodes  which  are 
present,  obviously  suffering  no  fragmentation  cr  overflow.  Set 
d  (1)  to  be  the  number  of  present  nedes  of  type  T(1).  Given  d(1), 
we  allocate  d(1)*d(2)  nodes  for  T  (2)  .  We  want  to  choose  d  (2)  to 
ninimize  eg.  3-1  (our  choice  of  d{1)  is  optimal) .  Similarly,  for 
type  T  (3)  we  allocate  d  ( 1 )  •  d  (2)  •  d  (3)  nodes,  choosing  d{2)  and 
d  (3)  to  minimize  eg.  3-1.  The  general  allocation  problem  for  type 
T  (i)  is  to  allocate 
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nodes,  choosing  each  of  the  d{j)  so  that  eg, 3-1  is  minimized.  By 
the  fcllcwing  lemma,  we  show  that  the  optimal  d  (i)  values  are  the 
same  for  the  allocation  of  all  types  with  the  same  ancestors. 

L  smma  If  d{j)=s(j)  for  1<j<i-1,  where  each  s(j)  is  a  kncwn 
positive  integer  constant,  minimizes  eg.  3-1  for  type  i-1, 
then  those  values  for  d  (j)  minimize  eg, 3-1  for  type  i. 

Proof 

The  proof  uses  the  fact  that  eg,  3-1  is  mcnotonically 
increasing  with  respect  to  increasing  the  length  of  the  trace.  A 
detailed  proof  can  be  found  in  Appendix  III. 


This  lemma  simplifies  the  optimizaticn  cf  the  d(i)’s 
cons  ider  atly ,  It  means  that  if  types  T(1),  T  (2)  ,  T(i-1)  are 
allocated  optimally  using  d{1),  d(i-1),  then  those  same 
values  for  d{1),  d{i-1)  are  best  for  T(i),  reducing  the 
optimizaticn  to  the  single  variable  d (i) . 

T^lthough  it  is  possible  for  a  numerical  procedure  to 
choose  a  cutpcint  and  an  overflow  area  size  tc  minimize  the  total 
cost  (eg.  3-1)  ,  this  type  of  ’’autcmatic"  optimization  is  probably 
net  desirable.  First,  the  factors  K1  and  K2  are  hard  to  quantify; 
comparing  time  to  space  is  never  easy.  Second,  there  are  economic 
factors  which  are  not  considered  in  the  numerical  optimization. 
For  example,  a  system  may  have  an  extra  on-line  disk  drive 
available.  The  data  base  administrator  might  choose  to  extend  the 
primary  area  to  the  point  where  fragmentation  is  excessive, 
simply  because  it  will  postpone  a  reorganization  for  a  long  time. 
Given  some  environmental  change  (e.g.,  a  new  inventory  shipment), 
a  manager  may  be  able  to  predict  a  drastic  change  to  data  base 
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statistics^  such  as  insertion  rate  or  filial  set  distribution. 
Such  considerations  make  a  computer  assisted  optimization 
procedure  more  desirable  than  a  fully  automated  one.  The  system 
can  easily  calculate  values  for  expected  amount  of  fragmentation 
and  overflow  area  size,  given  a  new  cutpcint.  System  predictions 
about  reorganization  frequency  and  insertion  rate  might  also  be 
useful.  Given  these  kinds  of  tools,  a  data  bass  administrator 
should  be  able  to  make  an  intelligent  choice  of  outpoint. 

Most  of  the  calculations  for  choosing  d(i)  are  nearly 
the  same  for  a  bounded  diagonal  pairing  function  as  the 
rectangular  one.  The  diagonal  method  should  have  the  same 
probabilistic  behaviour  as  a  rectangular  scheme,  so  the  only 
change  is  in  the  implementation  of  the  address  mapping  itself.  It 
should  be  obvious  in  the  discussion  of  impleraenration 
considerations  that  the  diagonal  scheme  is  no  more  complex  to 
implement  than  the  rectangular  one. 

3 ^  h  Practical  I mplementation  of  Cutrcints 

There  are  serious  problems  with  the  practicality  of  the 
above  scheme  for  computing  the  d{i) ’s.  First,  the  density 
functions,  fi,  are  generally  not  known  in  advance.  Worse  still, 
the  fi  change  with  time,  which  was  the  entire  reason  for  using  a 
outpoint  scheme,  i.e.,  a  dynamic  data  base,  A  second  problem 
relates  to  the  addressing  mechanism.  Although  it  is  easy  to 
construct  a  rectangular  pairing  function  for  a  given  d  (i) ,  it  is 
not  as  easy  to  modify  the  pairing  function  so  that  when  d  (i)  is 
increased  the  originally  allocated  nodes  still  map  onto  the  same 
addresses.  Boxh  problems,  it  turns  cut,  are  solvable. 
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To  typass  the  fi®s,  vs  make  the  scheme  adaptive.  The 
data  base  system  rraintains  a  histogram  of  filial  set  sizes  for 
each  type.  The  histogram  can  be  normalized  with  respect  to  the 
number  of  nodes  present  of  the  type,  and  the  result  should  be  a 
good  approximation  to  the  real  density  function.  The  adaptive 
scheme  then  works  something  like  this.  Tor  each  i,  d  (i)  is  set 
equal  to  some  initial  value.  As  nodes  are  inserted  into  the  data 
base,  the  histograms  are  updated,  ^v^hen  a  type  has  had  more  than  K 
insertions,  where  K  is  some  suitable  constant,  the  histogram  is 
normalized  into  a  density  function  and  the  calculation  for  the 
optimal  d  (i)  is  made.  If  d  (i)  should  be  increased,  then  type  i 
and  all  higher  numbered  types  have  their  allocations  extended  to 
reflect  the  increase.  Overflow  areas  are  checked  and  nodes  which 
can  now  fit  into  the  primary  area  are  moved.  If  d(i)  is  not 
changed,  then  of  course  all  the  allocations  remain  the  same.  The 
data  base  thus  grows  only  as  the  insertions  warrant  it, 

^^hen  a  d  (i)  value  is  incremented  to  di  (i)  ,  then  type  i 
nodes  with  traces  [x(1),...,x{i)],  where  d(i)<x{i)<di{i),  must  be 
addressed  differently  from  those  with  x{i):<d{i)  (see  figure  3-2). 
Let  N(i-1)  be  the  number  of  nodes  of  type  i- 1  allocated  when  d (i) 
was  increased  to  di{i).  Then  a  node  [x(1),...,x(i)],  where 
x{j)<d{j)  for  1<j<i-1  and  d  (i)  <x  (i)  <di  (i)  is  addressed  by 

1.  finding  the  index  which  [x(1)  ,...,x{i-1)  ]  maps  onto 
(call  it  I) ,  and 

2.  setting  Mi  {[  x  ( 1)  , .  .  .  ,x  (i)  ])  =  ■N(i-1)*d(i)  + 

(1-1)  *  (di  (i)  -  d(i))  +  x(i)  -  d{i)  . 

The  idea  is  that  the  first  N{i-1)  d (i)  nodes  have  already  been 
allocated  on  the  file.  To  allocate  the  nodes  between  d^  (i)  and 
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d  (i)  requires  allocating  an  additional  N{i-1)  {di{i)-d(i)) 
records  at  the  end  of  the  file*  These  nodes  must  be  addressed 
differently#  as  described  above,  Furlhermcr e#  descendants  of 
nodes  lying  in  the  newly  allocated  area  should  be  allocated. 
Every  descendant  of  a  type  i  node  which  has  x (i)  in  its  trace  in 
the  range  d {i) <x (i) <d i (i)  will  have  the  same  anomaly  in  its 
address  calculation.  A  gaining  function  which  has  these  kinds  of 
breaks  can  be  implemented  using  a  decision  table. 
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Figure  3-2  Increas ing  the  Width  of  a  Bounded 

Rectangular  Pairing  Function 


(arrows  indicate  the 
layout  of  the  elements 
on  a  linear  file.)  - 

X*  is  the  maximum  value  allocated  for  x  when  the  value  or 
d(i)  is  increased  to  d"''(i),  that  is, 

i  -1 

X*  =  n  (i(j) 

3  =  1 

given  (x,y) ,  calculate  Mi(x,y)  as  follows: 

if  x^x*  and  y<d(i)  then  Mi(x,y)  =  (x-l)-d(i}  +  y 

if  x<x*  and  d(i)  <y<d^  (i) then  Mi(x,)0  “  x^'d(i) 

+  (x-l)'{d  (i) -d(i)  ) +y-d(i) 

if  x>x=^  then  Mi(^,y)  =  (x-l)-d^(i}  +  y 
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decision  table  consists  cf  a  set  of  S  rows  which, 
correspond  to  conditi cns  and  a  set  cf  C  columns  which  correspond 
to  rules  (see  fig,  3-3).  [McDaniel  1968],  .An  entry  for  a  given 
[  condition,  rule  ]  pair  in  the  table  comes  from  the  set  {'M'',  ”0”, 
"don' t  care"} .  do  "execute"  the  decision  table,  the  conditions 
are  evaluated  to  obtain  a  vector  cf  E  ones  and  zeroes 
corresponding  to  true  and  false.  The  rule  is  chosen  which  matches 
the  ones  and  zeroes  of  the  vector.  The  rule,  which  is  implemented 
as  a  program,  is  then  executed. 

The  decision  table  can  be  used  to  inplement  a  pairing 
function,  Pi(x,y),  for  type  i.  The  pairing  function  uses 
different  calculations  depending  on  the  value  of  y  (i.e.,  y=x (i) , 
see  fig.  3-1c).  If  we  use  the  conditions  to  evaluate  the  range  of 
y,  then  choosing  the  correct  computation  for  the  pairing  function 
can  be  automated.  If  d (i)  is  extended  again,  another  rule  and 
another  row  in  the  decision  table  for  Pi  are  added  automatically. 


Rule^  Rule^  . 


Cond^ 

Cond2 


Cond 

i 


Cond^ 


. .  Rule  - 
J 


Rule 


R 


Entry  i,j  can  hav 
a  value  of  "0",  '' 
or  "don't  care". 


If  conditions  Cond^  Cond^,  when 

evaluated,  match  all  the  O’s  and 
I's  in  some  column,  say  Rule^,  then 
execute  Rule^^. 
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for  allocating  trees  is 

both  dynamic  and 
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optimally,  its 

vary 

due  to  the  shape  of  the 

computed  density 
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function  has  a  very  high 

variance,  then 

)  has 

to  be  kept  low  to  ir. inimiz 

e  fragmentation. 

while  at  the  same  time  it  has  to  be  kept  high  so  that  a 
reasonable  fraction  of  the  type  is  within  the  primary  area.  In 
such  a  case,  any  ccmpromise  value  for  d  (i)  is  likely  to  be  bad, 
including  the  optimal  value.  We  hypothisize  that  uniform 
distributions  will  be  uncommon  for  most  types,  but  the  only  way 
tc  verify  this  is  to  collect  data  across  a  large  number  of  data 
bases.  In  any  case,  the  dynamic  properties  of  the  scheme  may  be 
sufficiently  important  to  compensate  for  the  cost  of  wasxed 
space.  It  is  still  probably  cheaper  than  allocating  statically 
and  relocating  the  whole  data  base  as  the  tree  grows. 

^  •  Conclusion 

In  this  paper  we  examined  two  ways  of  allocating  tree 
structured  data  bases  without  the  use  of  explicit  links:  namely, 
hashing  and  pairing  functions.  In  the  static  case  where  the 
expected  number  of  nodes  in  each  type  is  known  a  priori,  hashing 
was  shown  to  yield  good  results  at  high  loading  factors,  A 
outpoint  scheme  for  pairing  functions  was  also  discussed  which 
guarantees  minimal  cost. 

Any  comparison  with  existing  schemes  in  data  base 
systems  is  very  difficult.  We  need  to  have  an  idea  of  the  dynamic 
properties  of  a  data  base.  In  addition,  some  statistics  are 
needed  about  the  nature  and  freguency  of  typical  queries.  We  feel 


nevertheless  that  the  schemes  presented  can  offer  some  advantages 
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over  some  traditionally  used  access 


methods' 


\ 


j.T,.' 


to 


allocate 


< 


hierarchical  data  bases. 


'  • 
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4  8 


Uniform  Hashing  Algorithm 

He  -want  to  construct  an  algorithm  -which  maps  traces  of 
the  form  [  x  { 1 x  (2 x  (n)  ] ,  where  1<x  (i)  <max  (i)  ,  into 
addresses  in  the  range  [I^A],  To  do  this,  we  first  partition  the 
interval  [0,A]  into  max(1)  subintervals;  the  length  of  the  i-th 
subinterval  divided  by  A  is  proportional  to  the  expected  fraction 
of  traces  whose  first  index  is  i.  Each  of  these  subintervals  is 
then  partitioned  into  max  (2)  subintervals  according  to  the  same 
rule,  except  this  time  for  fype  2.  The  partiticning  process  is 
repeated  a  total  of  n  times,  so  that  each  final  subinterval 
corresponds  to  a  trace  [ x  ( 1 x (n)  ].  That  is,  traces  are 
mapped  1-1  onto  subintervals,  and  the  union  of  ail  subintervals 
covers  [0,A].  We  then  map  subintervals  onto  addresses  to  complete 
the  hash  mapping.  This  algorithm  has  the  property  that  the 
expected  fraction  of  present  traces  which  map  onto  each  address 
is  1/A,  In  addition,  nodes  -which  are  closely  related  (e.g.,  are 
in  the  same  filial  set  or  have  commcn  grandparents)  are  allocated 
close  together  in  the  address  range  to  improve  buffering 
characteristics.  We  develop  the  algorithm  and  its  correctness 
proof  concurrently. 

For  an  arbitrary  k,  1<k<max(1),  we  want  to  calculate  the 
fraction  of  type  n  nodes  having  traces  which  begin  with  x(1)=k;. 
We  will  use  this  result  to  partition  the  range  [0,A]  into  max(1) 
subintervals,  where  the  length  of  the  k-th  subinterval  divided  by 
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a  will  equal  the  expected  fraction  of  present  traces  whose  first 
index  is  k , 

To  continue,  we  hnow  that: 


probahilit 7 


fraction  cf  traces  with  x  ( 1 )  =]<: 


FI  (k-1) 


0 


f  1  {k) 


Vk 


f  1  (k  +  1) 


V(’K+1) 


f1{max{1))  1/niax{1) 

So,  the  total  expected  fraction  cf  type  n  traces  which  have 
X  ( 1)  =k  is: 

max  (1) 

a(k)  =  Z  ((Vj)-(fl  (j))) 
j=k 

In  the  same  way  we  see  that  the  fraction  cf  type  n  traces 
beginning  with  x(1)>k  is: 

max  (1) 

t  (k)  =  z  (  ({i-k)/j)-(f1  (j))) 
j=k  +  1 

If  the  initial  interval  is  [0,A],  then  the  algorithm  chooses  an 
interval,  I,  for  x(1)=k: 

I(x(1)=k)  =  [A(1-a(k)-b(k)),A(1-b(k))  ] 

The  length  of  the  interval  is  A*a{k), 

Let  I  (x  f  1)  =k1,  x  (2)  =k2, .  . .  ,x  (i)  =ki)  =  [y(i),z(i)],  i<n, 
be  an  interval  of  length  z(i)-y(i).  The  inductive  assumption, 
pf  ii)  ,  is  that  {2  (i) -y  (i)  ) /A  equals  the  expected  fraction  of 


We  now  calculate 


traces  with  xi2)=\2,  .  .  .  j,  x{i)=ki. 

I  {X  (1)  =k1  j  ,  ,x  (i)  =ki,x  (i+1)  =k)  =  [  y  {i+ 1)  , 2  (i-s- 1)  ]  as  follows. 

The  expected  fraction  of  traces  with  x{i+1)=ky 
1  <k<inax  (i+  1)  ,  is: 

max  (i+  1) 

a(k)  =  E  (j))) 

j=k 

The  fraction  of  traces  with  x(i-»-1)>k;  is: 
max  {i+1) 

b(k)  =  Z  (({j-h)/j)*(fi+1{j))) 

j=k 

Choose 

y(i+1)  =  (2  (i) -y  (i)  )-(1-a  (k) -b  (k) )  +  y(i) 

2(i+1)  =  (2  (i) -y  (i)  )•(  1-b  (k)  )  +  y  (i) 

We  now  have  an  interval  [ y  (i+ 1 ) , 2  (i+ 1 )  ] . 

We  must  now  show  that  ^  (i)  =>  ^(i+1). 

the  length  of  [y(i+1)  ,z{i+1)  ] 

=  (z(i)-y  (i))-(1-b(k))  +  y(i)  -  { (z  (i) -y  {i)  )•  ( 1-a  (k) -b  (k) ) 

+  y(i)) 

=  (2  (i) -y  (i)  (k)  )  +  y(i)  -  (z  (i) -y  (i)  )•  ( 1-b  (k) ) 

+  (2  (i)  »y  (i)  )-a  (k)  -  y  (i) 

=  <2  (i) -y  (i)  )*a  (k) 

We  want  to  show  (z  (i+ 1) -y  (i-*- 1 ) ) /A  equals  the  expected  fraction  of 
traces  with  x(1)=k1y  x(2)=k2,  ...,x{i)=ki,  x(i+1)=k, 

(z  (i+1)-y  (i^-1)  )/A  =  (z(i)-y(i))  (a{k))/A 

We  know  that: 

(2  {i) -y  (i) ) /A  is  the  fraction  of  nodes  with 

X  { 1) =k1 , . . , ,x  (i) =ki. 
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a  {k)  is  the  fraction  of  nodes  of  type  i  +  1  which  have  index  k 
in  their  filial  set  by  construction. 

Hence,  (z  (i) -y  {i)  )*(a  (k)  ) /i^.  is  the  fraction  of  nodes  with 

X  ( 1)  ==k  1 , .  . .  ,  X  (i)  =ki ,  X  (i+ 1 )  =k , 

which  establishes  the  induction 

j?{i)  =>  p{i+1) 

The  above  procedure  (see  figure  1-1)  is  applied  for  each 
X  (i)  in  the  trace.  The  final  interval, 

I(x{1)=k1,...,x{n)=kn)=[y{n),z{n)  ],  has  the  property  that  ( (z  (n)  - 
y(n))/A  equals  the  expected  fraction  of  nodes  with  x(1)=k1, 
x(n)=kn,  by  induction, 

■Sle  Eust  now  map  the  intervals  onto  integer  addresses.  If 
the  interval  [y(n),z(n)]  lies  completely  within  some  interval  [ i- 
1,i]  where  1<i<A,  then  we  can  map  [y(n),z(n)  ]  onto  i.  The  problem 
arises  if  [y(n),z(n)]  overlaps  two  ranges,  say  [i-1,i]  and 
[i,i+1].  In  this  case,  we  will  map  [y(n),z(n)  ]  onto  either  i  or 
i+1,  depending  upon  which  unit  interval  contains  more  of 
[ y  (n)  ,z(n)  ].  Of  course,  we  cannot  guarantee  that  the  expected 
fraction  of  traces  which  map  onto  each  address  i  in  A  is  exactly 
1/A,  but  clearly  it  will  be  nearly  1/A  on  the  average. 

(It  is  net  necessary  to  consider  the  case  of  [y(n),z(n)  ] 
overlapping  more  than  two  intervals.  lor  [y(n),z(n) ]  to  overlap 
three  intervals,  (z  (n) -y  (n) ) >  1  •  But  this  is  impossible  if  A  is  at 
least  as  large  aS  the  expected  number  cf  traces  which  are 
present,  or,  equivalently,  that  the  loading  factor  is  lass  than 
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Figure  i- 1  Uniform  Hashing  Algorithm 


high  =  -A 
low  =  0 
oldlow  =  0 
DO  i  =  1  TO  n 
max  (i) 

a  =  E  {  {1/j).(fi  (j)  )  ) 
max  (i) 

fc  =  E  {({j-k)/j)-(fi{j))) 

j==k+1 

lew  =  (high  -  oldlow)*(1  -  a  -  t) 
high  =  (high  -  oldlow). (1  -  b) 
oldlow  =  low 

END  ■ 

Map  the  interval  onto  the  integer  i  whose  associated 
interval,  [i-1,i],  contains  the  largest  portion  of 
[ low , high  ] . 
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i^PPENDIX  II 


lilial  Set  C isxributicn  Data 

The  acccmpanying  figures  describe  the  filial  set  size 
distribution  for  a  large  lEM/IIlS  dafa  base  containing  about 
155000  root  nodes  and  about  1.8  million  nodes  altogether.  The 
definition  tree  for  the  data  base  is  described  in  figure  II-1, 
and  the  filial  set  sine  distributions  for  each  type  are  graphed 
in  figure  II-2. 
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^P^DIX  III 
of  Lemma  f^m  Pa_^  39 

We  want  to  prove  the  following  lemma. 

Lemma  If  d(j)=s{j)  for  1<j<i-1,  where  each  s(j)  is  a  known 
positive  integer  constant,  minimizes  eg,  3-1  for  type  i- 
1,  then  those  values  for  d(j)  also  minimize  eg.  3-1  for 
type  i. 

Proof 

Recall  eg,  3-1  is 

TOT  =  K1-{NAP  +  OV)  K2-OV.  (IE  •POV.LF)/(NP-  EOI) 

As  far  as  the  choice  cf  the  d  (j)  ’s  are  concerned,  the  following 
variables  may  be  considered  to  be  constant:  K1,  K2,  IH ,  LF,  and 
EOV,  We  can  therefore  rewrite  TOT  as 
TOT  =  a-h^P  +  OV-(a  +  b-FCV) 

where  a=K1 

b=  {K2- IB- LF) /  (NF-EOV) 

Using  CV=POV+ECV,  we  can  further  simplify  TOT  to 

TOT  =  a-WAP  +  c  +  d-POV  +  EOV-(a  +  b-PGV) 

where  c=a*E0V 
d=b- EOV 

Thus,  there  are  only  two  variables  to  optimize  over,  namely,  POV 
and  NAP. 
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i  max{j)  i  max  ( j) 

pov  =  n  z  k'fjfk)  -  n  [d(j)*E  fj(k)  +  z  k-fj(k)] 
j=i  k=i  j=i  k=a(j)  k=i 


We  nov  separate  the  terms  containing  d (i)  from  the  rest  of  the 
€  X  pre  s  sicn  s. 

i-  1 

NAP  =  d  (i)  •  n  d  { j) 

j=1 


max  (i)  i- 1  max(j)  max  (i)  d(i)-1 

PC?  =  {  2  k-fi(k))-{  n  z  k-fj{k))  -  {d  (i)  -  z  fi(k)+  z  k-fi(k)) 
k=1  i=1  k=1  k-d(i)  k=1 

i“  1  max  (i)  d  { j)  -  1 

•(  n  [d<j)  •  f  j  (k)  +  2  k-f  j  (k)  ]) 

j=1  k=d{j)  k=1 

(eq.  .111-1) 


Ne  can  easily  show  that 

max  (i)  max{i)  d  (i)  - 1 

Z  k-fi(k)  >  d  (i)  Z  fi(k)  +  z  k-fi(k)  (eg.  III-2) 
k=1  k=d{i)  k=1 

from  the  fact  that 

max(i)  max(i)  d{i)-1 

Z  k'fi{k)  =  Z  k*fi(k)  +  z  k*fi(k) 

k=1  k=d(i)  k=1 


From  our  hypothesis  and  the  fact  that  all  the  factors  of  POV 

are  positive,  we  know  that  the  expression, 

i- 1  max{j)  i-1  max  (j)  d  ( j) -1 

exp=n  Z  k*fj(k)-n[d(j)'Z  fj(k)+  Z  k-fj(k)] 

j=1  k=1  j=1  k=d(j)  k  =  1 


is  minimal  for  d(j)=s(j) .  When  we  rewrife  this  expression  for 
type  i,  we  get  eq.  III- 1 .  From  eq.  III-1  and  III-2  we  get 

exp  <  POV 


Looking  at  the  equation  for  TOT,  we  see  that  extending  it  from 
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tvpe  i- 1  to  type  i  increases  the  value  of  every  term. 


Further¬ 


more/  the 
of  d  ( j)  / 
i— 1/  then 


choice  of  d (i)  is  independent  of  any  of  the  old  values 
<j<i-1.  If  d(j)=s{j)/  1ij<i-1/  minimizes  IGI  for  type 
they  must  also  he  the  best  values  of  these  variables 


for  type  type  i,  thus  proving  the  lemma. 
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